Project 4: Natural Language Processing - First Year Project 2021

Working with natural language data


Group 8: Ida Maria Zachariassen, Magnus Sverdrup, Rasmus Bondo Hansen, Ruben Oliver Jonsman and Sabrina Fonseca Pereira

This notebook contains all the code developed in the Project 4 - Natural Language Processing

Contact/Group:

Introduction

The project presented in this notebook was developed with the purpose of predicting the intention or state of mind (pragmatics) of social media data from Twitter.

The data was provided from the TweetEval corpus, a collection of 7 datasets for different classification tasks. Each task had test, train and validation datafiles consisting of one tweet per line with corresponding labelling in a separate file. Given a tweet we were to predict a label, based on a model trained on tokens of our data.

More precisely this project aims to predict the label of the binary classification task Irony and the multi class classification task Stance, specifically Atheism. The Irony tweets to either be labelled as 0 - non_irony or 1 - irony and the Atheism tweets to either be labelled as 0 - none, 1 - against or 2 - favor. For further testing the possibility of predicting stance, all topics in the Stance task were gathered to see if it was possible to detect both topic and stance, getting a total of 15 labels.


Required Libraries

Consistent variables

Paths to the data

Dataframes

Irony

Split irony training dataset into two subsets. One for creating our tokenizers and one for evaluating them

Stance

Abortion
Atheism
Climate
Feminist
Hillary

Manual annotation answers for the irony dataset

1. Preprocessing

Tokenization is the task of splitting a string of characters into minimal processed units, also called tokens, which will be the input in our machine learning solutions. As a starting point we aimed at segmenting the lines at “words” and turned to discuss the significance of written language on social media platforms like Twitter.

We saw that our ideal tokenizer should:

Further interesting tokenizers were:

We did this using the RegEx module.

We lastly compared the output of our final tokenizer with a baseline tweet tokenizer from the NLTK library. To investigate this we used the difflib library. Here each token is compared and the difference between them displayed.

Tokenizers

We have created a few different tokenizers. However, to illsutrate our work we will through most of the notebook stick with our "tokenize_ideal" tokenizer. In the end when we have created our ML models to predict stance and irony, we will compare the different tokenizers and evaluate which one is the best.

Here is a small sample when using our "tokenize_ideal" function:

Comparison with baseline tokenizer

To understand how general tokenizers work, and to figure out the quality of our own tokenizers, we wanted to compare our tokenizers with a baseline tokenisation tool. Here we use the social media tokeniser called TweetTokenizer from the nltk library: https://www.nltk.org/api/nltk.tokenize.html. Since we removed a small part from the training dataset in the begnining, we now have some tweets to evaluate our tokenizers and compare our tokenizers.

To see the difference quantitatively we can use the method SequenceMatcher from the difflib library: https://docs.python.org/3/library/difflib.html

Here we see that on average our tokenizer and the NLTK baseline tokenizer agree 65% of the time. However, none of these two methods are perfect, and this number of 65% is not really good/bad or high/low. Alternativly we can also look at the actual difference between the two tokenizers.

Here we again use the difflib library, now a method called unified_diff, which returns the area where there differences occur. As an example we only look at the first tweet.
"+" means that the word is present in the output from the baseline tokenizer but not in the output from our tokenizer.
"-" means that the word is not present in the output from the baseline tokenizer but it is in our tokenizer.

It can however be seen that the comparing is case sensitive, which causes the biggest difference, as it was decided to lower case everything in our tokenizer.
To summerize, our chosen tokenizer actually removes more than the baseline. In the end we will conclude if this makes a difference when predicting.

2. Characterising Your Data

We create a function to quickly report the number of lines, words and characters in case unix commands for macs doesn't work.

To simplify our process of finding statistics we create a function to quickly create a vocabulary for a given dataset.

Irony

Corpus, vocabulary and token-ratio

Corpus size

Count of tweets per label

Token-ratio

First tokenizing all the tweets in the Irony class using the Tokenize_ideal function, then creating a corresponding vocabulary and the unique token frequency.

Most frequent tokens in our vocabulary and their corresponding occurences.

Amount of least occuring tokens and examples hereof.

Frequency table

Accumulating the frequency for each word in the vocabulary dataframe.

Plotting the cummulative count for the tokens.

Illustration of Zipf's law

"given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word.", this quote is taken from Wikipedia. With this law, we can expect that very few words make up the biggest part of any given context. To verify this claim, one can in a log-log transformed system plot the frequency of each token against the tokens rank. This should create something close to a straight line.

Therefore we add two new columns to our vocabulary dataframe, each is the log transformed values for the frequency and the rank.

By plotting the log frequency and the log rank we see an almost linear relationship between the tokens.

N-Grams

To investigate the relationship between words, we are going to use n-grams. This method is explained through the Markov assumption: "Each element of the sequence depends only on the immediately preceding element and is independent of the previous history". With this we can define the k'th order Markov assumption: "Each element of the sequence depends only on the k immediately preceding elements."

With this tool we can define the probabilities of k preceding words for every unique word in a given context. When we are able to predict words, we are also able to predict meaning or context of new sentences. As mentioned we will be using n-grams in our project which comes from the python library NLTK which has multiple n-gram implementations.

To begin we are going to look at a bigram. To prepare every tweet for the n-gram model, we have to pad them. Meaning we are putting a "s" (start of sentence) symbol at the front, and a "/s" (end of sentence) symbol at the end. This can give meaning to words that tend to appear more often in the beginning or at the end.

Here we see a "bigram", which is just a n-gram with degree of 2. Meaning we are only looking 1 word back in the sentence, for every word. For example the word "walking" is in this case being related to the preceding word "ppl".

This is an example of a trigram, n-gram with degree of 3. Here we are looking and defining based on the 2 preceding words.

We can also combine the bigram and trigram to create everygrams. Like this we also include the lower degrees of grams. If we for example use a everygram with degree of 3, we not only create the trigram, but we also include the bigram and unigram information.

Now with the tweets prepared we can begin to train a model. To do this we have created a small function that takes the tokenized tweets and a degree of the everygram. Here we are using the pipeline function from NLTK which does the padding for us very easily.

We train two models, one with degree 3 and one with degree 2

To validate the models we can look at how confused it is when trying to predict on words. Here we use the NLTK perplexity function, the higher the score the more options the model has to choose from, meaning it is unsure what is most correct. To test this we feed the first sentence in the irony training dataset to the two models.

For both models it seems to have an okay (low) perplexity with some of the words, but for others it has a high score, meaning it is very confused about what to predict here.

Lastly we can also use these models to generate sentences based on the training data. The generated data is often hard to make sense of or just a copy of a tweet. This is probably due to the model being quite confused, or that there only exists one prediction for this specific case. When talking NLP we have to keep in mind that our training data is a small dataset, and therefore we would expect low performance.

Maximum likelihood

Maximum likelihood is about estimating preceding words in a given context. Meaning what is the probability of x given y. This is simply the sum of x given y in the corpus divied by the number of times y appear in the corpus:

$$p(w_2|w_1) = \frac{\text{count}(w_1w_2)}{\text{count}(w_1\bullet)}$$


To illustrate the maximum liklihood principle, and the problems with it, we creating a subset of the irony training dataset. These tweets we tokenize, pad and create their bigrams and put into a list.

We then create a dictionary that stores every preceding word to every unique token in the corpus.

Lastly we sum all the times the specific x given y appears and divide by the times y appears in the corpus.

We can then see all the probabilities of the preceding words for the token "user". For example, the probability of "user" given "user" is 34.5% percent, and the probability of "user" given "feat" is 100% percent. Meaning if our model wants try predict what word comes after the token "feat", then it will with 100% certainty predict the token "user".

The reason for this model to predict "feat" as preciding "user" with 100% certainty is that "feat" only appears once in the corpus. This is obviously not good, since we know there are many more posibilities of words that can appear after "feat" than just "user" in future tweets. To tackle this problem we will discuss smoothing techniques next.

Smoothing Techniques

Smoothing techniques in NLP are used to address scenarios related to determining probability / likelihood estimate of a sequence of words (say, a sentence) occuring together when one or more words individually (unigram) or N-grams such as bigram(wi/wi−1) or trigram (wi/wi−1wi−2) in the given set have never occured in the past.

Kneser–Ney smoothing

Kneser–Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories. It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies.

Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Kneser–Ney smoothing

Custom word embedding

Information about word embedding was mostly found here: https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/
The code for the word embedding implementation was greatly inspired by this web post: https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/

Every word can be represented by a vector. Word embeddings are N-dimensional vectors that try to capture word-meaning and context in their values. For the vectors to be useful, they should for a vocabulary capture the meaning of the words, the relationship between words, and the context of different words as they are used naturally.

To capture this we can encode different meanings to the words, and therefore represent each word as an N dimensional vector. Then the relation between each word is measured by the euclidean distance.

If one manages to create meaningfull encodings and assigns the values of each word in a good way, it can help machine learning models. Meaning words that were not seen during training, but is known in the word embedding space, will not be the cause of the model not functioning. For example if a model is trained to recognise vectors for “car”, “van”, “jeep”, or “automobile”, it will still behave well to the vector for “truck” due to the similarity of the vectors. Word embedding only works well with alot of data, thereore we are going to need more data than just the irony training data. For the demonstration of word embedding we are going to use all the training data from "emotion", "hate", "irony", "offensive" and "sentiment" datasets, as well as the testing data from the "sentiment" dataset (because it provided a lot of data).

With all this data combined we now have 75000 tweets. We then tokenize the tweets.

We train our Word2Vec model with the tokenized master dataset. We set our vectors to have a dimension of 100, set the context window to be 5 words and exclude words that only appear once.

We see a vocabulary of 35206 unique words, which each appears more than once in the master dataset.

Now we can see the top words that are supposed to have similar context - often of same lexical category. One would expect the word "good" to lie next to words like "nice" and "great", and indeed our model is predicting this. However, it is also predicting the opposite meaning words like "bad" and "hard", as they are also words that could replace "good" and the sentence would probably still make sense.

Stance: Atheism

Corpus, vocabulary and token-ratio

Corpus size

Count of tweets per label

Token-ration

Tokenizing the tweets of the Stance: Atheism class and creating the corresponding vocabulary as well as the token count.

Most frequent tokens and their corresponding count in the vocabulary for the Stance: Abortion class.

Amount of least occuring tokens and examples hereof.

Accumulating the frequency for each word in the vocabulary dataframe.

Plotting the cummulative count for each word.

Illustration of Zipf's law

By plotting the log frequency and the log rank we see an almost linear relationship between the tokens.

Comparing the datasets

Comment on the difference between the binary class and the multi-class
mean words per tweet
most frequent words in each
how many words occur once twice etc

First looking at the average number of characters and words in each class.

In the evaluated dataset, people twitting about religion (or not being religous) wrote longer tweets than people expressing irony.

Here we see that most of the top words above top 5 are mostly the same, even when comparing completly different topics. This could mean that the context of the tweet is based on very few words, and the rest are just filler words.

As twitter usenames were all changed to 'user' we can see a big difference between user mentions in both datasets. For irony the word 'user' ranks number 1 with a frequency of 1668, and for atheism it ranks 7 with a frequency of 135.

Even with the difference between corpus size, we can still see that use mentions are much higher in the irony dataset.

Differences in balance/inbalance

The irony dataset is mapped as follows:

The atheism dataset is mapped as follows:

As we can see above, the irony dataset has almost perfect balance, unlike the atheism dataset where we can see a big inbalance between classes.

And as there are many more tweets against atheism than neutral or in favour, this imbalance can become an issue because the predictive model would be biased towards predicting 'against' and to lesser degree to 'none' than 'favor'.

3. Manual Annotation and Inter-Annotator Agreement

Smaller datasets, each consisting of 120 randomly selected tweets from each task’s training set, were provided to do manual annotation and Inter-Annotator Agreement thereof. The chosen task to do annotations was Irony. Before annotating an agreed upon scheme was discussed based on the research paper “SemEval-2018 Task 3: Irony Detection in English Tweets” provided from the SemEval workshop, describing how each of the original datasets was created and annotated. Here it is stated that all annotations were done using the brat rapid annotation tool, a Web-based Tool for NLP-Assisted Text Annotation, specifically following the report “Guidelines for Annotating Irony in Social Media Text”. From this, we agreed on some definitions of irony:

Irony is the use of words to express something other than and especially the opposite of the literal meaning.

Meaning we first and foremost were to annotate a tweet as being ironic, if the tweet was ironic by means of a literal meaning clash or not.
However, as the guide describes there are different forms of irony and an instance can therefore contain another from of irony, where there is no polarity clash, but the tweet still will be found ironic, e.g. situational irony.

Therefore, when investigating the tweets, we were to think about whether the tweet was:

Working independently each group member manually went through the sample and labelled them according to the agreed upon scheme. Next the Inter-annotator agreement (IAA) coefficients were computed using the nltk.metrics.agreement module.

Individual scores.

Confusion matrix for the annotations.

First taking a look at the IAA coefficients when not including the truth labels.

Now, when including the true labels as an annotator, we get the following IAA coefficients:

It is important to remember that the original annotations are done by humans. They might have been through more training and been taking longer time to do the annotations, but they are not necessarily "true annotations". We therefore include them, as it is an indicator of the diffuculty of carrying out annotation tasks consistently.

Finding the pair-wise kappa scores, mostly to make a figure for the report and seeing the distribution of agreement.

(Landis and Koch, Biometrics 1977)

Agreement Without Chance Correction
The group got an average observed agreement of 0.60, which according to the rule of thumb given by Landis and Koch-scale is on the edge between being a moderate and substantial level of agreement. The average observed agreement between the group members can however be due to chance, as the Irony dataset only has two possible labels, causing the probability of getting a label right by chance to be high.

$$A_O = \frac{\text{no. matches}}{\text{no. total items}}$$

In the NLTK library we used the function avg_o() which computed the “average observed agreement across all coders and items.”
General problems with the observed agreement is the bias in favor of dimensions with a small number of categories. Therefore also the problem of getting figures that are comparable across studies of different dimensions with different number of categories, as it must be adjusted for chance agreement to do so.

Chance-corrected Agreement
The three best-known coefficients, $S$ (Bennett, Alpert, and Goldstein 1954), $\pi$ (Scott 1955), and $\kappa$ (Cohen 1960) is in it’s basic form used when measuring agreement between two coders. Yet the NLTK library made it possible for us to use with more than two coders.

The above mentioned coefficients use the following formula: $$ S, \pi, \kappa = \frac{\text{observed agreement} - \text{expected agreement}}{1 - \text{expected agreement}}$$

The difference between $S$, $\pi$ and $\kappa$ lies in the assumptions to the calculation of the chance of coder $c_i$ assigning an arbitrary item to category $k$, as follows:

From: Inter-Coder Agreement for Computational Linguistics

From the NLTK librabry these functions were used to compute the following with the corresponding scores as a result:

Inter-annotator agreement problematics

Common reasons for low intra-annotor agreement scores is:

In general Inter-Annotator Agreement gives an indication of how well-defined and reproducible the task is. In this case it shows that the task of detecting irony in writing and especially a somewhat short text as a tweet is difficult even for the human reader. When the group discussed the annotation of the dataset afterwards it was agreed upon that possible context was lacking. As mentioned earlier all instances of present irony was not necessarily being in the means of a clash, but also other type of irony, making the guidelines possibly too loose and cause of diverging interpretations of when something is ironic or not.

4. Automatic Prediction


This section is split into two parts, Binary Classification on Irony Dataset, and Multiclass-classification on the Stance Dataset. The muliticlass-classification is further split into two parts, doing classification on a sub-stance (Atheism) and classifying whether a tweet is a specific stance with a model trained on all of the different stances at the same time.

Binary Classification on Irony Dataset

Tokenizing data

Using the tokenize_ideal function from earlier to tokenize the data and converting it to be an useable datatype.

Loading the train labels

Balanced data, thus we try and optimize the accuracy score.

Creating pipeline

Creating the pipeline, to make it easier to preprocess, train/fit and predict on data. Having it be a function to reduce duplicate code.

Baseline model

Training our baseline model on our tokenized training corpus.

An accuracy_score of 0.556 is not the best.

Classification models different models

Training different classifying models without any special tuning to see which is the best in its simplest form.

Comparison

Finding the different metrics that are useable and give insight into how good the specific model is. Here we mainly look at accuracy_score and f1 score since recall and precision can be "gamed" but they can not both be gamed at the same time, thus using the f1 score gives credible insight into both recall and precision and thus an overall score for the performance of the model.

Accuracy is the proportion of elements classified correctly:

$$Accuracy = \frac{\text{sum of diagonal}}{\text{total sum}}$$

Accuracy can be very misleading in unbalanced datasets.

Precision: Out of the examples we predicted to be in a certain class, how many of them are correct?

$$ Precision=\frac{\text{single diagonal element}}{\text{sum of a single column} } $$

Recall: Out of the examples that actually belong to a certain class, how many of them did we find?

$$ Recall = \frac{\text{single diagonal element}}{\text{sum of a single row} } $$

F-score: Harmonic mean of Precision and Recall

$$ F_{score} = 2 \cdot \frac{P\cdot R}{P+R} $$

Now doing cross-validation on the validation dataset and taking the mean of the cross-validation-scores, we can find the overall best models for our task.

Here we see that none of our models perform particularly well, though we still have some models performing better than others.

Tuning the best models

Now we try to find the best possible combination of parameters to achieve the highest F-score.

Starting with the Random Forest Classifier

Random Forest Classifier

Initializing a simple RFC model (no parameters), that can be tweaked

Testing all possible combinations of chosen parameters and finding the F-score (very time consuming)

Support vector

Initializing a simple Support Vector model (no parameters), that can be tweaked

Testing all possible combinations of chosen parameters and finding the F-score (very time consuming)

Logistic Regression
Final model on the basis of the validation data

We see that the models that we have tuned to have the highest accuracy score are all equally bad at classifying whether a given sentence is ironic.

It is logical that our model is really bad at classifying text-based irony, since the group members are also very poor at classifying irony. Thus we can not expect the model to outperform us massively, since it lacks world knowledge and general experience and the emotional aspect of understanding the users intentions. Perhabs combining an intention model and an irony model could increase the performance of the irony model.

Final model is LogisticRegression, since it has an higher accuracy score.

Final model on test data metrics

Multi-Classification on Stance Dataset

First Multi-classification, mapping all the different stances to a 3x5 map taking values from 0-14 and classifying on the basis of these 15 different labels.

Creating the neccesery dataframes

Tokenizing our training and validation tweets

Here we see that the data is not balanced.

Baseline model

Classification models different models

Training different classifying models without any special tuning to see which is the best in its simplest form.

Here the precision score will be flawed since our models does not predict all labels in the dataset, i.e the NaiveBayes only predicts 5 out of the 15 different labels in our data, which makes calculating the precision score for these non-predicted labels not possible, since they have not been predicted. This is also the reason that warnings is being returned.

In our validation data, we have a label that is only occuring twice in our data, which means that we can not perform any proper cross-validation scores on it, as we simply do not have enough data to support any conclusions. This is also the reason that a lot of our classifiers is ignoring these classes with few members, because we train on data that does not have many members of that class.

This cross validation is thus flawed since a lot of our models is ignoring a lot of the classes in our data, such as the NaiveBayes is ignoring more than half of the classes in the data. This makes it a very bad model for classifying stance, but in the cross validation it, at first glance, looks to have a very high score, which is very misleading.

Tuning the best models

We then decide to try and optimize the models that actually predict most of the classes in our dataset.

Now we try to find the best possible combination of parameters to achieve the highest F-score.

Logistical Regression
Random Forest Classifier
KNearestNeighbor
Final model

Here we see that the KNN is the best model for our multiclass 3x5 stance mapping, since it only excludes one classification out of the total 15. The accuracy of the KNN is surprisingly high, given the difficulty of the task with this many different classes. Also if one were to randomnly guess (uniformly) what the class was, the probability of getting one correct would be $\frac{1}{15}$ which our models are a lot better than. Meaning our models, even though very bad, are actually better than random.

Final model: KNN

Final model on test data metrics

Checking what subjects test data it performs well on

By only predicting one of the validation datasets using the multiclass 3x5 mapping, we can test what classes it performs well and poorly on.

Here we can not look at the precision and the recall scores, since it will take into account all of the different classes that we are not trying to predict.

Here we see that the model varies a lot on the different subject stances.

Multi-classification on a single stance subject (Atheism)

Necessary Datasets

Tokenizing our training and validation tweets

Data is imbalanced, thus we use macro averaged recall.

Baseline model

Classification models different models

Training different classifying models without any special tuning to see which is the best in its simplest form.

Tuning the best models

Random Forest Classifier
Support Vector Classifier
Logistical Regression

Here we again see that the logistic regression is the best on all the metrics, thus it is our final model.

Final model on test data metrics

Evaluating our tokenizers on the final ML atheism model

Ideal tokenizer

Tokenize ekstra

NLTK tokenizer

no tokenizer

To compare the tokenizers with the scenario where we do not use a tokenizer

Evaluatin of tokenizers

We see that all the tokenizers produce different results

Here we see that the tokenizer, which we have identified to be the best and most appropriate for this project, perform relatively equal to the baseline tokenizer. Oddly enough they perform as good as the raw data, which is weird. If we were to use the tokenizer implemented in CountVectorizer we would see an increase in our metrics, thus if we were to do this project for real and the rule of using our own tokenizer did not exist, then we could achieve higher accuracies.

5. Conclusion

For futher studies much more data would be necessary to do word embedding and in general improve our models. Especially more data in the stance-files.

Binary classification for Irony:

Final model: LogisticRegression

 - accuracy score:       0.5535714285714286
 - recall score:        0.6688102893890675
 - precision score:       0.45714285714285713
 - f1 score:       0.5430809399477806

Overall the highest accuracy score of all of the tuned models were equally bad at classifying whether a given sentence is ironic. ince the group members also performed badly at classifying irony, we can not expect the model to outperform us massively, since it lacks world knowledge and general experience and emotional aspect. Combining an intention model and an irony model could possibly increase the performance of the irony model.

Multi-class classification for Atheism:

Final model: LogisticRegression

 - classes not predicted by model:      []
 - accuracy score:       0.6409090909090909
 - recall score:        0.3354166666666667
 - precision score:       0.3465909090909091
 - f1 score:       0.33073166826769135

The model for classifying the stance in the dataset Atheism performed best out of the bunch. In the end the LogisticRegression model was chosen, as it included all labels, opposite some of the other models, and had the overall highest scores. The distribution looked fairly linear and we had to use the macro-average recall, because the data was unbalanced - which was likely also the cause of some of the models leaving out some of the classes when predicting.

Multi-class classification for all stances:

Final model: KNN

 - classes not predicted by model:      []
 - accuracy score:       0.21857485988791034
 - recall score:        0.1293985722138177
 - precision score:       0.1290124308772997
 - f1 score:       0.12514616152456073

The KNN was seen as the best models, since it only excluded one classification out of the total 15 on the training data. Given the difficulty of the task with this many different classes the accuracy of the KNN is surprisingly high - better than random, yet still bad. Safe to say is that little amount of training data for some classes (imbalance) and a lot of labels to predict from is not promissing.

Evaluating tokenizer

Our final developed tokenizer performed relatively equal to the baseline tokenizer. If doing this project for real higher accuracies could be achieved if we were to use the tokenizer implemented in CountVectorizer, as this would increase our metrics.